Goto

Collaborating Authors

 optimality condition


Variational Regularized Unbalanced Optimal Transport: Single Network, Least Action

Neural Information Processing Systems

Recovering the dynamics from a few snapshots of a high-dimensional system is a challenging task in statistical physics and machine learning, with important applications in computational biology. Many algorithms have been developed to tackle this problem, based on frameworks such as optimal transport and the Schrรถdinger bridge. A notable recent framework is Regularized Unbalanced Optimal Transport (RUOT), which integrates both stochastic dynamics and unnormalized distributions. However, since many existing methods do not explicitly enforce optimality conditions, their solutions often struggle to satisfy the principle of least action and meet challenges to converge in a stable and reliable way. To address these issues, we propose Variational RUOT (Var-RUOT), a new framework to solve the RUOT problem. By incorporating the optimal necessary conditions for the RUOT problem into both the parameterization of the search space and the loss function design, Var-RUOT only needs to learn a scalar field to solve the RUOT problem and can search for solutions with lower action. We also examined the challenge of selecting a growth penalty function in the widely used Wasserstein-Fisher-Rao metric and proposed a solution that better aligns with biological priors in Var-RUOT.


From Saddle Points Toward Global Minima: A Newton-Type Method on Wasserstein Space

arXiv.org Machine Learning

We study the minimization of non-convex functionals over the Wasserstein space. While recent work has showed that perturbed Wasserstein gradient methods can avoid saddle points for benign landscapes, existing approaches remain essentially first-order and do not provide fast local convergence once the iterates enter a neighborhood of a global minimizer. We propose Wasserstein Saddle-Free Newton (WSFN), a second-order method that preconditions the Wasserstein gradient by a regularized square root of the squared Wasserstein Hessian. This construction preserves attraction toward directions of positive curvature while inducing repulsion along directions of negative curvature, thereby overcoming the tendency of standard Wasserstein Newton dynamics to be attracted to saddles. We also establish second-order sufficient optimality conditions on Wasserstein space for strict local minimality. Under regularity and benign landscape assumptions, we prove that WSFN escapes saddle regions and reaches an $ฮฑ$-neighborhood of a global minimizer in polynomial time, with improved dependence on saddle parameters compared with prior perturbed first-order methods. Once inside this neighborhood, we show that WSFN converges linearly in $L^2$-Wasserstein distance to a non-degenerate global minimizer. Finally, we present a particle-based implementation of the method.


Supplementary information for Learning Gaussian Mixtures with Generalised Linear Models Precise Asymptotics in High dimensions

Neural Information Processing Systems

This appendix presents the proof of the main technical result, Theorem 1. Throughout the whole proof, we assume that the set of conditions from Sec. 2 is verified. A.1 Required background In this Section, we give an overview of the main concepts and tools on approximate message passing algorithms which will be required for the proof. We start with some definitions that commonly appear in the approximate message-passing literature, see e.g. The main regularity class of functions we will use is that of pseudo-Lipschitz functions, which roughly amounts to functions with polynomially bounded first derivatives. We include the required scaling w.r.t. the dimensions in the definition for convenience. Since K will be kept finite, it can be absorbed in any of the constants. For example, the function f: Rn R,x7 1nkxk22 is pseudo-Lipshitz of order 2. Moreau envelopes and Bregman proximal operators -- In our proof, we will also frequently use the notions of Moreau envelopes and proximal operators, see e.g.



Efficient and Modular Implicit Differentiation

Neural Information Processing Systems

Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems.


Supplemental Materials: AConsolidated Cross-Validation Algorithm for Support Vector Machines via Data Reduction ATechnical Proofs

Neural Information Processing Systems

C.2 Consolidated CV with random features Alternatively, one can use random features (Rahimi and Recht, 2007) to approximate the kernel matrix. Suppose that we consider shift-invariant kernels that satisfy K(x,y) = K(x y). In this work we use the radial kernel K(x,y) = exp( ฯƒ x y 22). The kernel can be approximated by K(x,y) ฯ†(x),ฯ†(y), where an explicit randomized feature mapping ฯ†: IRp IRm is obtained by sampling from a distribution defined by the inverse Fourier transformation.


Matrix Completion has No Spurious Local Minimum

Neural Information Processing Systems

Matrix completion is a basic machine learning problem that has wide applications, especially in collaborative filtering and recommender systems. Simple non-convex optimization algorithms are popular and effective in practice. Despite recent progress in proving various non-convex algorithms converge from a good initial point, it remains unclear why random or arbitrary initialization suffices in practice. We prove that the commonly used non-convex objective function for positive semidefinite matrix completion has no spurious local minima - all local minima must also be global. Therefore, many popular optimization algorithms such as (stochastic) gradient descent can provably solve positive semidefinite matrix completion with arbitrary initialization in polynomial time. The result can be generalized to the setting when the observed entries contain noise. We believe that our main proof strategy can be useful for understanding geometric properties of other statistical problems involving partial or noisy observations.


Convex Two-Layer Modeling with Latent Structure

Neural Information Processing Systems

Unsupervised learning of structured predictors has been a long standing pursuit in machine learning. Recently a conditional random field auto-encoder has been proposed in a two-layer setting, allowing latent structured representation to be automatically inferred. Aside from being nonconvex, it also requires the demanding inference of normalization. In this paper, we develop a convex relaxation of two-layer conditional model which captures latent structure and estimates model parameters, jointly and optimally. We further expand its applicability by resorting to a weaker form of inference--maximum a-posteriori. The flexibility of the model is demonstrated on two structures based on total unimodularity--graph matching and linear chain. Experimental results confirm the promise of the method.


The non-convex Burer-Monteiro approach works on smooth semidefinite programs

Neural Information Processing Systems

Semidefinite programs (SDPs) can be solved in polynomial time by interior point methods, but scalability can be an issue. To address this shortcoming, over a decade ago, Burer and Monteiro proposed to solve SDPs with few equality constraints via rank-restricted, non-convex surrogates. Remarkably, for some applications, local optimization methods seem to converge to global optima of these non-convex surrogates reliably. Although some theory supports this empirical success, a complete explanation of it remains an open question. In this paper, we consider a class of SDPs which includes applications such as max-cut, community detection in the stochastic block model, robust PCA, phase retrieval and synchronization of rotations.


Smoothed analysis of the low-rank approach for smooth semidefinite programs

Neural Information Processing Systems

We consider semidefinite programs (SDPs) of size nwith equality constraints. In order to overcome scalability issues, Burer and Monteiro proposed a factorized approach based on optimizing over a matrix Y of size n ksuch that X = YY is the SDP variable. The advantages of such formulation are twofold: the dimension of the optimization variable is reduced, and positive semidefiniteness is naturally enforced. However, optimization in Y is non-convex. In prior work, it has been shown that, when the constraints on the factorized variable regularly define a smooth manifold, provided k is large enough, for almost all cost matrices, all second-order stationary points (SOSPs) are optimal. Importantly, in practice, one can only compute points which approximately satisfy necessary optimality conditions, leading to the question: are such points also approximately optimal? To answer it, under similar assumptions, we use smoothed analysis to show that approximate SOSPs for a randomly perturbed objective function are approximate global optima, with k scaling like the square root of the number of constraints (up to log factors). Moreover, we bound the optimality gap at the approximate solution of the perturbed problem with respect to the original problem.